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Abstract We tackle the problem of learning linear classifiers from noisy datasets 
in a multiclass setting. The two-class version of this problem was studied a few 
years ago where the proposed approaches to combat the noise revolve around a Per- 
ceptron learning scheme fed with peculiar examples computed through a weighted 
average of points from the noisy training set. We propose to build upon these ap¬ 
proaches and we introduce a new algorithm called UMA (for Unconfused Multiclass 
additive Algorithm) which may be seen as a generalization to the multiclass setting 
of the previous approaches. In order to characterize the noise we use the confusion 
matrix as a multiclass extension of the classification noise studied in the aforemen¬ 
tioned literature. Theoretically well-founded, UMA furthermore displays very good 
empirical noise robustness, as evidenced by numerical simulations conducted on 
both synthetic and real data. 

Keywords Multiclass classification • Perceptron • Noisy labels • Confusion 
Matrix • Ultraconservative algorithms 


1 Introduction 

Context. This paper deals with linear multiclass classification problems defined on 
an input space X {e.g., X = K'^) and a set of classes 




In particular, we are interested in establishing the robustness of ultraconservative 
additive algorithms [10] to label noise classification in the multiclass setting—in 
order to lighten notation, we will now refer to these algorithms as ultraconservative 
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algorithms. We study whether it is possible to learn a linear predictor from a 
training set made of independent realizations of a pair {X, Y) of random variables: 

S = 

where j/i G Q is a corrupted version of a true label, i.e. deterministically computed 
class, t{xi) £ Q associated with Xi, according to some concept t. The random noise 
process Y that corrupts the label to provide the j/^’s given the Xi’s is supposed 
uniform within each pair of classes, thus it is fully described by a confusion matrix 
C = {Cpq)p^q G go that 


'^x,Cpt(^^) = Fy{Y =p\x). 

The goal that we would like to achieve is to provide a learning procedure able to 
deal with the confusion noise present in the training set 5 to give rise to a classifier 
h with small risk 

R{h)=¥x^v(h{X)^t{X)), 

V being the distribution according to which the xfs are obtained. As we want to re¬ 
cover from the confusion noise, i.e., we want to achieve low risk on uncorrupted/non- 
noisy data, we use the term unconfused to characterize the procedures we propose. 

Ultraconservative learning procedures are online learning algorithms that out¬ 
put linear classihers. They display nice theoretical properties regarding their con¬ 
vergence in the case of linearly separable datasets, provided a sufficient separa¬ 
tion margin is guaranteed (as formalized in Assumption 1 below). In turn, these 
convergence-related properties yield generalization guarantees about the quality 
of the predictor learned. We build upon these nice convergence properties to show 
that ultraconservative algorithms are robust to a confusion noise process, provided 
that: i) C is invertible and can be accessed, ii) the original dataset {{xi,t{xi))}2^^ 
is linearly separable. This paper is essentially devoted to proving how/why ultra¬ 
conservative multiclass algorithms are indeed robust to such situations. To some 
extent, the results provided in the present contribution may be viewed as a general¬ 
ization of the contributions on learning binary perceptrons under misclassihcation 
noise [6,7]. 

Beside the theoretical questions raised by the learning setting considered, we 
may depict the following example of an actual learning scenario where learning 
from noisy data is relevant. This learning scenario will be further investigated 
from an empirical standpoint in the section devoted to numerical simulations (Sec¬ 
tion 4). 

Example 1 One situation where coping with mislabelled data is required arises in 
(partially supervised) scenarios where labelling data is very expensive. Imagine a 
task of text categorization from a training set S = SiVJSu, where Si = {(a;^, j/i)}"=i 
is a set of n labelled training examples and Su = is a set of m unlabelled 

vectors; in order to fall back to realistic training scenarios where more labelled data 
cannot be acquired, we may assume that n <C m. A possible three-stage strategy 
to learn a predictor is as follows: hrst learn a predictor fi on Si and estimate its 
confusion error C via a cross-validation procedure—/ is assumed to make mistakes 
evenly over the class regions—, second, use the learned predictor to label all the 
data in Su to produce the labelled traning set S = {{xn+i,tn+i := f{xn+i))}iLi 
and hnally, learn a classifier / from S and the confusion information C. 
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This introductory example pertains to semi-supervised learning and this is only 
one possible learning scenario where the contribution we propose, UMA, might be 
of some use. Still, it is essential to understand right away that one key feature 
of UMA, which sets it apart from many contributions encountered in the realm of 
semi-supervised learning, is that we do provide theoretical bounds on the sample 
complexity and running time required by our algorithm to output an effective 
predictor. 

The present paper is an extended version of [21]. Compared with the original 
paper, it provides a more detailed introduction of the tools used in the paper, 
a more thorough discussion on related work as well as more extensive numerical 
results (which confirm the relevance of our findings). A strategy to make use of 
kernels for nonlinear classification has also been added. 

Contributions. Our main contribution is to show that it is both practically and 
theoretically possible to learn a multiclass classifier on noisy data if some informa¬ 
tion on the noise process is available. We propose a way to generate new points for 
which the true class is known. Hence we can iteratively populate a new unconfused 
dataset to learn from. This allows us to handle a massive amount of mislabelled 
data with only a very slight loss of accuracy. We embed our method into ultracon¬ 
servative algorithms and provide a thorough analysis of it, in which we show that 
the strong theoretical guarantees that characterize the family of ultraconservative 
algorithms carry over to the noisy scenario. 

Related Work. Learning from mislabelled data in an iterative manner has a long¬ 
standing history in the machine learning community. The first contributions on 
this topic, based on the Perceptron algorithm [22], are those of [7,6,8], which 
promoted the idea utilized here that a sample average may be used to construct 
update vectors relevant to a Perceptron learning procedure. These first contri¬ 
butions were focused on the binary classification case and, for [6,8], tackled the 
specific problem of strong-polynomiality of the learning procedure in the probably 
approximately correct (PAC) framework [20]. Later, [27] proposed a binary learn¬ 
ing procedure making it possible to learn a kernel Perceptron in a noisy setting; 
an interesting feature of this work is the recourse to random projections in or¬ 
der to lower the capacity of the class of kernel-based classifiers. Meanwhile, many 
advances were witnessed in the realm of online multiclass learning procedures. 
In particular, [10] proposed families of learning procedures subsuming the Per¬ 
ceptron algorithm, dedicated to tackle multiclass prediction problems. A sibling 
family of algorithms, the passive-aggressive online learning algorithms [9], inspired 
both by the previous family and the idea of minimizing instantaneous losses, were 
designed to tackle various problems, among which multiclass linear classification. 
Sometimes, learning with partially labelled data might be viewed as a problem of 
learning with corrupted data (if, for example, all the unlabelled data are randomly 
or arbitrarily labelled) and it makes sense to mention the works [19] and [25] as 
distant relatives to the present work. 

Organization of the paper. Section 2 formally states the setting we consider through¬ 
out this paper. Section 3 provides the details of our main contribution: the UMA 
algorithm and its detailed theoretical analysis. Section 4 presents numerical sim¬ 
ulations that support the soundness of our approach. 
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2 Setting and Problem 

2.1 Noisy Labels with Underlying Linear Concept 

The probabilistic setting we consider hinges on the existence of two components. 
On the one hand, we assume an unknown (but hxed) probability distribution D 
on the input space X = On the other hand, we also assume the existence of a 
deterministic labelling function t : X ^ Q, where Q = {1,... Q}, which associates 
a label t(x) to any input example x; in the Probably Approximately Correct (PAC) 
literature, t is sometimes referred to as a concept [20,29]. 

In the present paper, we focus on learning linear classifiers, dehned as follows. 

Definition 1 (Linear classifiers) The linear classifier fy^ : A —>■ Q is a classifier 
that is associated with a set of vectors W = [ici • • -rcQ] € which predicts 

the label fw{^) of any vector a: € A as 

fwix) = argmax {wq,x) . (1) 

q&Q 

Additionally, without loss of generality, we suppose that 

Px^p(||A|| = 1) = 1, 

where || • || is the Euclidean norm. This allows us to introduce the notion of margin. 


Definition 2 (Margin of a linear classifier) Let c : 5 — Q be some fixed concept. 
Let W = [lui • ■ • wq] G be a set of Q weight vectors. Linear classifier fyy is 

said to have margin 6 > 0 with respect to c (and distribution V) if the following 
holds: 

{3p ^ c{X) : (lyc(X) - Wj„X) < e} = 0. 

Note that if fyy has margin 9 > 0 with respect to c then 

Fx,^'D{fw{X)^c{X))=0. 

Equipped with this definition, we shall consider that the following assumption of 
linear separability with margin 6 of concept t holds throughout. 

Assumption 1 (Linear Separability of t with Margin 9.) There exist 9 > 0 and 
W* = [ii)i • • • icq] G with II IT* 11^ = 1 (^11 • IIf denotes the Frobenius norm) such 

that fyyt has margin 9 with respect to the concept t. 

In a conventional setting, one would be asked to learn a classifier / from a 
training set 

5true = {(a:j,t(a:j))}(Li 

made of n labelled pairs from A x Q such that the xfis are independent realiza¬ 
tions of a random variable X distributed according to V, with the objective of 
minimizing the true risk or misclassification error 7?error(/) of / given by 

7?error(/) = ^ t[X)). (2) 

In other words, the objective is for / to have a prediction behavior as close as 
possible to that of t. As announced in the introduction, there is however a little 
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twist in the problem that we are going to tackle. Instead of having direct access 
to 5true, we assume that we only have access to a corrupted version 

5 = {(a:i,2/j)}r=i (3) 

where each t/j is the realization of a random variable Y whose distribution agrees 
with the following assumption: 

Assumption 2 The law 'T>y\x °f ^ same for all x £ X and its conditional 

distribution 

is fully summarized into a known confusion matrix C given by 

V®, Cpt(^) = {Y = p\X = x) = PY~-Dyix=a, (y = = <l)- (4) 

Alternatively put, the noise process that corrupts the data is uniform within 
each class and its level does not depend on the precise location of x within the 
region that corresponds to class t{x). The noise process Y is both a) aggressive, as 
it does not only apply, as we may expect, to regions close to the class boundaries 
between classes and b) regular, as the mislabelling rate is piecewise constant. 
Nonetheless, this setting can account for many real-world problems as numerous 
noisy phenomena can be summarized by a simple confusion matrix. Moreover it 
has been proved [6] that robustness to classihcation noise generalizes robustness to 
monotonic noise where, for each class, the noise rate is a monotonically decreasing 
function of the distance to the class boundaries. 

Remark 1 The confusion matrix C should not be mistaken with the matrix C of 
general term: Cij = ~ *1^ ~ j) which is the class-conditional 

distribution of t[X) given Y. The problem of learning from a noisy training set 
and (7 is a different problem than the one we aim to solve. In particular, C can be 
used to dehne cost-sensitive losses rather directly whereas doing so with C is far 
less obvious. Anyhow, this second problem of learning from C is far from trivial 
and very interesting, and it falls way beyond the scope of the present work. 

Finally, we assume the following from here on: 

Assumption 3 C is invertible. 

Note that this assumption is not as restrictive as it may appear. For instance, if 
we consider the learning setting depicted in Example 1 and implemented in the 
numerical simulations, then the confusion matrix obtained from the first predictor 
is often diagonally dominant, i.e. the magnitudes of the diagonal entries are 
larger than the sum of the magnitudes of the entries in their corresponding rows, 
and C is therefore invertible. Generally speaking, the problems that we are inter¬ 
ested in {i.e. problems where the true classes seems to be recoverable) tend to have 
invertible confusion matrix. It is most likely that invertibility is merely a sufficient 
condition on C that allows us to establish learnability in the sequel. Identifying less 
stringent conditions on C, or conditions termed in a different way—which would 
for instance be based on the condition number of C —for learnability to remain, is 
a research issue of its own that we leave for future investigations. 

The setting we have just presented allows us to view S = {(a;^,as 
the realization of a random sample {{Xi,Yi)}f^^, where each pair {Xi,Yi) is an 
independent copy of the random pair {X, Y) of law Vxy = 



6 


Ugo Louche, Liva Ralaivola 


2.2 Problem: Learning a Linear Classifier from Noisy Data 

The problem we address is the learning of a classifier / from 5 and C so that the 
error rate 

/?error(/) = Px-^X)(/W ^ iW) 

of / is as small as possible: the usual goal of learning a classifier / with small risk 
is preserved, while now the training data is only made of corrupted labelled pairs. 

Building on Assumption 1, we may refine our learning objective by restricting 
ourselves to linear classifiers /w, for W = [wi ■ ■ ■ wq] £ (see Definition 1). 

Our goal is thus to learn a relevant matrix W from 5 and the confusion matrix 
C. More precisely, we achieve risk minimization through classic additive methods 
and the core of this work is focused on computing noise-free update points such 
that the properties of said methods are unchanged. 


3 Uma: Unconfused Ultraconservative Multiclass Algorithm 

This section presents the main result of the paper, that is, the UMA procedure, 
which is a response to the problem posed above: UMA makes it possible to learn 
a multiclass linear predictor from S and the confusion information C. In addi¬ 
tion to the algorithm itself, this section provides theoretical results regarding the 
convergence and sample complexity of UMA. 

As UMA is a generalization of the ultraconservative additive online algorithms 
proposed in [10] to the case of noisy labels, we first and foremost recall the essential 
features of this family of algorithms. The rest of the section is then devoted to the 
presentation and analysis of UMA. 


3.1 A Brief Reminder on Ultraconservative Additive Algorithms 

Ultraconservative additive online algorithms were introduced by Crammer et al. in 
[10]. As already stated, these algorithms output multiclass linear predictors fw as 
in Definition 1 and their purpose is therefore to compute a set W = [rui • • • wq] £ 
]^dxQ Qf Q weight vectors from some training sample 5true = {(aji, To 

do so, they implement the procedure depicted in Algorithm 1, which centrally 
revolves around the identification of an error set and its simple update: when 
processing a training pair {x,y), they perform updates of the form 

Wq ■(— Wq+ TqX, q = 1,.. .Q, 

whenever the error set £{x, y) defined as 

£{x,y) = {r £ Q\{y} : {wr,x} - {wy,x) > 0} (5) 

is not empty, with the constraint for the family {'rq}q^Q of step sizes to fulfill 

( Ty = 1 Q 

< Tr <0, if r € £{x,y) and Tr = 0. 

[ Tr = 0, otherwise r=i 


(6) 
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Algorithm 1 Ultraconservative Additive algorithms [10]. 

Input: Strue 

Output: W = [lui,... jIuq] and associated classifier fw{') = argmax g{wq, •} 

Initialization: Wq <— 0, Vij S Q 
repeat 

access training pair {xt,yt) 

compute the error set £{xt,yt) according to (5) 

if £(xt,yt) ^ 0 then 

compute a set {rqjg^Q of update steps that comply with (6) 
perform the updates 

Wq -I— Wg + TgXg, \/q £ Q 

end if 

until some stopping criterion is met 


The term ultraconservative refers to the fact that only those prototype vectors Wr 
which achieve a larger inner product {wr, x) than (tOy, x), that is, the vectors that 
can entail a prediction mistake when decision rule (1) is applied, may be affected by 
the update procedure. The term additive conveys the fact that the updates consist 
in modifying the weight vectors tur’s by adding a portion of x to them (which is 
to be opposed to multiplicative update schemes). Again, as we only consider these 
additive types of updates in what follows, it will have to be implicitly understood 
even when not explicitly mentioned. 

One of the main results regarding ultraconservative algorithms, which we ex¬ 
tend in our learning scenario is the following. 

Theorem 1 (Mistake bound for ultraconservative algorithms [10].) Suppose 
that concept t is in accordance with Assumption 1. The number of mistakes/updates 
made by one pass over S by any ultraconservative procedure is upper-bounded by ‘IjtP’. 

This result is essentially a generalization of the well-known Block-Novikoff theo¬ 
rem [5,23], which establishes a mistake bound for the Perceptron algorithm (an 
ultraconservative algorithm itself). 


3.2 Main Result and High Level Justification 

This section presents our main contribution, UMA, a theoretically grounded noise- 
tolerant multiclass algorithm depicted in Algorithm 2. UMA learns and outputs 
a matrix W = [lyi-'-iog] G fj-om a noisy training set S to produce the 

associated linear classifier 


/ly(-) = argmax (lo,, ■) (7) 

by iteratively updating the iCq’s, whilst maintaining = 0 throughout the 

learning process. As a new member of multiclass additive algorithms, we may read¬ 
ily recognize in step 8 through step 10 of Algorithm 2 the generic step sizes {Tq}q^Q 
promoted by ultraconservative algorithms (see Algorithm 1). An important feature 
of UMA is that it only uses information provided by S and does not make assumption 
on the accessibility to the noise-free dataset 5true: the incurred pivotal difference 
with regular ultraconservative algorithms is that the update points used are now 
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Algorithm 2 UMA: Unconfused Ultraconservative Multiclass Algorithm. 

Input: S = confusion matrix C E and a > 0 

Output: W = [i^i,..., 'iT/c] classifier fwi') = argmax g{wq, •} 

1: Wf. 0, V/c E Q 
2: repeat 
3: select p and q 

4: compute set Ap as 


Ap ^ {x\x E <S, {wp, x) — {wk, x) > \fk ^ p} 

5: for k = 1,..., Q, compute 7 ^ as 

1 

'^k ■*“ “ ^ k}l{xi e Ap}xJ, Vk ^ Q 

i=l 

6: form E R^^'^ as 



7: compute the update vector Zpq according to {[M]q refers to the gth row of matrix M) 


8 : compute the error set £°‘(zpq, q) as 

£°‘{Zpq,q) -(-{re S\{lj} : {Wr,Zpq) - {Wq,Zpq) > o} 

9: if £“(zpq, q) 7 ^ 0 then 

10 : compute some ultraconservative update steps ti, ... ,tq such that: 

( rq = l Q 

< Tr < 0,Vr £ £°‘{zpq,q) and = 0 

I Tr = 0 , otherwise r=l 

11: perform the updates for r = 1,... ,Q: 

Wr £- Wr + TrZpq 

12: end if 

13: until ||zpij|| is too small 


the computed (line 4 through line 7) Zpq vectors instead of the Xi’s. Establishing 
that under some conditions UMA stops and provides a classifier with small risk when 
those update points are used is the purpose of the following subsections; we will 
also discuss the unspecihed step 3, dealing with the selection step. 

For the impatient reader, we may already leak some of the ingredients we use 
to prove the relevance of our procedure. Theorem 1, which shows the convergence 
of ultraconservative algorithms, rests on the analysis of the updates made when 
training examples are misclassified by the current classifier. The conveyed mes¬ 
sage is therefore that examples that are erred upon are central to the convergence 
analysis. It turns out that steps 4 through 7 of UMA (cf. Algorithm 2) construct a 
point Zpq that is, with high probabilty, mistaken on. More precisely, the true class 
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t{zpq) of Zpq is q and it is predicted to be of class p by the current classifier; at the 
same time, these update vectors are guaranteed to realize a positive margin condi¬ 
tion with respect to W*-. {wq,Zpq) > {w'l,Zpq} for all k ^ q. The ultraconservative 
feature of the algorithm is carried by step 8 and step 10 , which make it possible to 
update any prototype vector Wr with r ^ q having an inner product (lOr, Zpg) with 
Zpq larger than {wq,Zpq) (which should be the largest if a correct prediction were 
made). The reason why we have results ‘with high probability’ is because the Zpq’s 
are sample-based estimates of update vectors known to be of class q but predicted 
as being of class p, with p ^ q\ computing the accuracy of the sample estimates is 
one of the important exercises of what follows. A control on the accuracy makes 
it possible for us to then establish the convergence of the proposed algorithm. 


3.3 With High Probability, Zpq is a Mistake with Positive Margin 


Here, we prove that the update vector Zpq given in step 7 is, with high probability, 
a point on which the current classifier errs. 

Proposition 1 Let W = [lui • ■ • wq\ € and a > 0 be fixed. Let Ap be defined 

as in step 4 of Algorithm 2, i.e: 

Ap = {x\x G S, {wp, x) — (tiifc, x) > Q, Mk ^ p} . (8) 

For k G Q, p k, consider the random variable 7 ^ ( 7 ^ in step 5 of Algorithm 2 is a 
realization of this variable, hence the overloading of notation jj’.)' 

^ e A^}xl. 

i 

The following holds, for all k G Q: 


% {7^ = {7^ = E 


q=l 


where 


pPq = {l{t(X) = q}l{X G A“}aT} . 

Proof Let us compute = fc}I{X G Ap]X^}-. 

= k}l{X gA’^]X^} 

r ^ 

= / I{g = k}l{x G Ap }a;^Py(y = q\X = x)d'Dx{x) 

q=l 

= [ Ija: G 7lp}a:^Pv(Y' = k\X = x)d'Dx{^) 

Jx 

= I{a: e Ap}x^Ckt{x)d'Dx{x) 

r ® 

= / E £ Tip }x^CkqdVx{x) 

Jx 

Q r Q 

= E £ Ap}x^dVx{x) = CkqpFq, 

^- 1 JX ^ - 1 


(9) 

( 10 ) 


(cf. (4)) 
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where the last line comes from the fact that the classes are non-overlapping. The n 
pairs {Xi,Yi) being identically and independently distributed gives the result. □ 

Intuitively, /ig must be seen as an example of class p which is erroneously 
predicted as being of class q. Such an example is precisely what we are looking for 
to update the current classifier; as expecations cannot be computed, the estimate 
Zpq of pq is used instead of pq. 

Proposition 2 Let W = [lui • • • wq] € and a > 0 be fixed. Forp, q & Q, p ^ q, 

Zpq £ is such that 


^'Dxy 

( 11 ) 

{Wq,Pq} - {'U’*k,P^} >9,\/k^ q, 

( 12 ) 

(wp, pP) - {Wk,p\) >a,yk^ p. 

(13) 


(Normally, we should consider the transpose of Pq, but since we deal with vectors of 
—and not matrices—we abuse the notation and omit the transpose.) 

This means that 

t) t(p^) = q, i.e. the ‘true’ class of p^ is q; 

it) and fwil^q) = Pi Tq ** therefore misclassified by the current classifier /ly. 

Proof According to Proposition 1, 


f 

-1 

N 


-1 

tJ 


'Y.%,C,qpl- 


-1 

_1 

^VxY { 



> = 


= 


= C 


[ 

1 

0 “^ 



1 




V 


Hence, inverting C and extracting the gth of the resulting matrix equality gives 
that E {Zpq} = /iq. 

Equation (12) is obtained thanks to Assumption 1 combined with (10) and the 
linearity of the expectation. Equation (13) is obtained thanks to the definition ( 8 ) 
of Ap (made of points that are predicted to be of class p) and the linearity of the 
expectation. □ 


The attentive reader may notice that Proposition 2 or, equivalently, step 7, is 
precisely the reason for requiring C to be invertible, as the computation of Zpq 
hinges on the resolution of a system of equations based on C. 


Proposition 3 Let e > 0 and 5 € (0; 1]. There exists a number 


no{£,S,d,Q) = O 


In ^ -|- In Q -|- d In - 
5 e 


such that if the number of training samples is greater than no then, with high probability 

{w*,Zpq) - {wl,Zpq) > e - e (14) 

{Wp,Zpq) - {Wk,Zpq) >0, yk ^p. (15) 
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Proof The existence of no relies on pseudo-dimension arguments. We defer this 
part of the proof to Appendix A and we will directly assume here that if n > no, 
then, with probability 1 — <5 for any W, Zpq. 

\{Wp-Wq,Zpq)-{wp-Wq,flP)\<S. (16) 

Proving (14) then proceeds by observing that 

(Wq - wl,Zpq) = (Wq - wl, fj,^) + {w* - wl, Zpq - 

bounding the first part using Proposition 2: 

(w* - > 6 

and the second one with (16). A similar reasoning allows us to get (15) by setting 
a = e vcL Ap . □ 

This last proposition essentially says that the update vectors •Lpq that we compute 
are, with high probability, erred upon and realize a margin condition 6 — e. 

Note that a is needed to cope with the imprecision incurred by the use of 
empirical estimates. Indeed, we can only approximate {wp, Zpq) — Zpq) in (15) 
up to a precision of e. Thus for the result to hold we need to have {'Wp,fiq) — 
{w);,fiq) > £ which is obtained from (13) when a = e. In practice, this just says 
that the points used in the computation of Zpq are at a distance at least a from 
any decision boundaries. 

Remark 2 It is important to understand that the parameter a helps us derive 
sample complexity results by allowing us to retrieve a linearly separable training 
dataset with positive margin from the noisy dataset. The theoretical results we 
prove hold for any such a > 0 parameter and the smaller this parameter, the larger 
the sample complexity, i.e., the harder it is for the algorithm to take advantage 
of a training samples that meets the sample complexity requirements. In other 
words, the smaller q, the less likely it is for UMA to succeed; yet, as shown in the 
experiments, where we use 0 = 0, UMA continues to perform quite well. 


3.4 Convergence and Stopping Criterion 

We arrive at our main result, which provides both convergence and a stopping 
criterion. 

Proposition 4 Under Assumptions 1, 2 and 3 there exists a number n, polynomial 
in d, 1/9, Q, 1/5, such that if the training sample is of size at least n, then, with high 
probability (1 — 5), UMA makes at most 0(1/9^) updates. 

Proof Let Sz the set of all the update vectors Zpq generated during the execution 
of UMA and labeled with their true class q. Observe that, in this context, UMA (Alg. 
2) behaves like a regular ultraconservative algorithm run on Sz- Namely: a) lines 
4 through 7 compute a new point in Sz, and b) lines 8 through 10 perform an 
ultraconservative update step. 

From Proposition 3, we know that with high probability, w* is a classiher with 
positive margin 9 — e on Sz and it comes from Theorem 1 that UMA does not make 
more than 0{l/9^) mistakes on such dataset. 
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Because, by construction, we have that with high probability each element of 
Sz is erred upon then |5z| € that means that, with high probability, UMA 

does not make more than 0{l/6^) updates. 

All in all, after 0(1/6^) updates, there is a high probability that we are not 
able to construct examples on which UMA makes a mistake or, equivalently, the 
conditional misclassihcation errors P(/ry(A) = p\Y = q) are all small. □ 

Even though UMA operates in a batch setting, it ‘internally’ simulates the exe¬ 
cution of an online algorithm that encounters a new training point {zpq £ Sz) at 
each time step. To more precisely see how UMA can be seen as an online algorithm, 
it suffices to imagine it be run in a way where each vector update is made after 
a chunk of n (where n is as in Proposition 4) training data has been encountered 
and used to compute the next element of Sz - Repeating this process 0{l/6^) times 
then guarantees convergence with high probability. Note that, in this scenario, UMA 
requires n! = 0{n/6‘^) data to converge which might be far more than the sample 
complexity exhibited in Proposition 4. Nonetheless, n' still remains polynomial in 
d, 1/9, Q and 1/5. For more detail on this (online to batch conversion) approach, 
we refer the interested readers to [ 6 ]. 


3.5 Selecting p and q 

So far, the question of selecting good pairs of values p and q to perform updates has 
been left unanswered. Indeed, our results hold for any pair {p, q) and convergence 
is guaranteed even when p and q are arbitrarily selected as long as Zpq is not 0. 
Nonetheless, it is reasonable to use heuristics for selecting p and q with the hope 
that it might improve the practical convergence speed. 

On the one hand, we may focus on the pairs (p, q) for which the empirical 
misclassihcation rate 

1 "■ 

ifwiX) ^ t{X)} = -J2^fw{^^) ^ t{xq)} (17) 

i=l 

is the highest (X ~ 5 means that X is randomly drawn from the uniform dis¬ 
tribution of law X i-> 5 I]r=i dehned with respect to training set 

S = {(aji, j/i)}"^j^). We want to favor those pairs (p,g) because, i) the induced up¬ 
date may lead to a greater reduction of the error and ii) more importantly, because 
Zpq may be more reliable, as Ap will be bigger. 

On the other hand, recent advances in the passive aggressive literature [24] 
have emphasized the importance of minimizing the empirical confusion rate, given 
for a pair (p, q) by the quantity 

1 ” 

{fwiX) =p\t{X) =q} = —Vl{t(a:i) =q,fwi^t) =p}, (18) 

”9 i=i 

where 

n 

nq = = g}. 

i=l 

This approach is especially worthy when dealing with imbalanced classes and one 
might want to optimize the selection of (p, g) with respect to the confusion rate. 



Unconfused Ultraconservative Multiclass Algorithms 


13 


Obviously, since the true labels in the training data cannot be accessed, neither 
of the quantities dehned in (17) and (18) can be computed. Using a result provided 
in [ 6 ], which states that the norm of an update vector computed as Zpq directly 
provides an estimate of (17), we devise two possible strategies for selecting (p, g): 

{P,9) error — argmax ||zpg|| (19) 

(p.9) 

(P,*?)™!!! = argmax (20) 

{p,q) ^9 

where nq is the estimated proportion of examples of true class q in the training 
sample. In a way similar to the computation of Zpq in Algorithm 2, Tig may be 
estimated as follows: 

= \ [C~^y\q, 

where y € is the vector containing the number of examples from S having 
noisy labels 1 ,..., Q, respectively. 

The second selection criterion is intended to normalize the number of errors 
with respect to the proportions of different classes and aims at being robust to 
imbalanced data. Our goal here is to provide a way to take into account the class 
distribution for the selection of (p, q). Note that this might be a hrst step towards 
transforming UMA into an algorithm for minimizing the confusion risk, even though 
additional (and signihcant) work is required to provably provide UMA with this 
feature. 

On a final note, we remark that (p, g)conf requires additional precautions when 
used: when (p, g)error is implemented, Zpq is guaranteed to be the update vector of 
maximum norm among all possible update vectors, whereas this no longer holds 
true when (p, q)conf is used and if Zpq is close to 0 then there may exist another 
possibly more informative—from the standpoint of convergence speed—update 
vector Zpiqi for some {p',q') {p,q)- 


3.6 UMA and Kernels 

Thus far, we have only considered the situation where linear classifiers are learned. 
There are however many learning problems that cannot be handled effectively 
without going beyond linear classihcation. A popular strategy to deal with such 
a situation is obviously to make use of kernels [26]. In this direction, there are 
(at least) two paths that can be taken. The first one is to revisit UMA and provide 
a kernelized algorithm based on a dual representation of the weight vectors, as 
is done with the kernel Perceptron (see [11]) or its close cousins (see, e.g. [18, 
13,17]). Doing so would entail the question of hnding sparse expansions of the 
weight vectors with respect to the training data in order to contain the prediction 
time and to derive generalization guarantees based on such sparsity: this is an 
interesting and ambitious research program on its own. A second strategy, which 
we make use of in the numerical simulations, is simply to build upon the idea of 
Kernel Projection Machines [4,28]: first, perform a Kernel Principal Component 
Analysis (shorthanded as kernel-PCA afterwards) with D principal axes, second, 
project the data onto the principal D-dimensional subspace and, hnally, run UMA on 
the obtained data. The availability of numerous methods to efficiently extract the 
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principal subspaces (or approximation thereof) [1,15,16,27,30] makes this path 
a viable strategy to render UMA usable for nonlinearly separable concepts. This 
explains why we decided to use this strategy in the present paper. 


4 Experiments 


In this section, we present results from numerical simulations of our approach 
and we discuss different practical aspects of UMA. The ultraconservative step sizes 
retained are those corresponding to a regular Perceptron: Tp = —1 and Tg = +1, 
the other values of Tr being equal to 0. 

Section 4.1 discusses robustness results, based on simulations conducted on 
synthetic data while Section 4.2 takes it a step further and evaluates our algorithm 
on real data, with a realistic noise process related to Example 1 (cf. Section 1). 

We essentially use what we call the confusion rate as a performance measure, 
which is : 


1 

VQ 




Where ||C||i;’ is the Frobenius norm of the confusion matrix C computed on a test 
set S'test (independent from the training set), i.e.: 


\\c\\F = J2^i’ withapq = 


0 

Ecp.gStest = P = g} 


if p = q, 
otherwise. 


with yi the label predicted for the test instance by the learned predictor. C is 
much akin to a recall matrix, and the 1 /\/Q factor ensure that the confusion rate 
is comprised within 0 and 1. 


4.1 Toy dataset 

We use a 10-class dataset with a total of roughly 1,000 2-dimensional examples 
uniformly distributed according to W, which is the uniform distribution over the 
unit circle centered at the origin. Labelling is achieved according to (1) given a set 
of 10 weight vectors wi,... ,ioio, which are also randomly generated according to 
U\ all these weight vectors have therefore norm 1. A margin 6 = 0.025 is enforced 
in the generated data by removing examples that are too close to the decision 
boundaries—practically, with this value of 6, the case where three classes are so 
close to each other that no training example from one of the classes remained after 
enforcing the margin never occurred. 

The learned classihers are tested against a dataset of 10,000 points that are 
distributed according to the training distribution. The results reported in the 
tables and graphics are averaged over 10 runs. 

The noise is generated from the sole confusion matrix. This situation can be 
tough to handle and is rarely met with real data but we stick with it as it is a 
good example of a worst-case scenario. 
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Robustness to noise. We first (Fig. 1(a)) evaluate the robustness to noise of UMA 
by running our algorithm with various confusion matrices. We uniformly draw a 
reference nonnegative square matrix M, the rows of M are then normalized, i.e. 
each entry of M is divided by the sum of the elements of its row, so M is a stochastic 
matrix. If M is not invertible it is rejected and we draw a new matrix until we 
have an invertible one. Then, we define N such that N = {M — /)/10, where I is 
the identity matrix of order Q; typically N has nonpositive diagonal entries and 
nonnegative off-diagonal coefficients. We will use N to parametrize a family of 
confusion matrices that have their most dominant coefficient to move from their 
diagonal to their off-diagonal parts. Namely, we run UMA 20 times with confusion 
matrices C G {Ci = Q{I -\- iA'')}^£j^, where 17 is a matrix operator which outputs 
a (row-)stochastic matrix: when applied on matrix A, 17 replaces the negative 
elements of A by zeros and it normalizes the rows of the obtained matrix; note 
that i = 10 corresponds to the case where C = M. Equivalently, one can think of 
Ci as the weighted average between I and 17 (N) where I has a constant weight 
of 1 and f2{N) is weighted by i. Note that, after some point, further increasing 
i has little effect on Ci as it eventually converges to 17(N). Figure 1(a) plots 
our results against the Frobenius norm of the diagonal-free confusion matrix C, 
that is: \\C — diag(C)||i;’ where diag(C') denotes the diagonal matrix with the same 
diagonal values as C. For the sake of comparison, we also have run UMA with a fixed 
confusion matrix C = I on the same data. This amounts to running a Perceptron 
through the data multiple times and it allows us to have a baseline for measuring 
the improvement induced by the use of the confusion matrix. 

Robustness to the incorrect estimation of the confusion matrix. The second experi¬ 
ment (Fig. 1(b)) evaluates the robustness of UMA to the use of a confusion matrix 
that is not exactly the confusion matrix that describes the noise process corrupt¬ 
ing the data; this will allow us to measure the extent to which a confusion matrix 
(inaccurately) estimated from the training data can be dealt with by UMA. Using 
the same notation as before, and the same idea of generating a random stochastic 
reference matrix M, we proceed as follows: we use the given matrix M to corrupt 
the noise-free dataset and then, each confusion matrix from the family {Cijf^^i 
is fed to UMA as if it were the confusion matrix governing the noise process. We 
introduce the notion of approximation factor p as p{i) = 1 — i/^0, so that p takes 
values in the set {—1, —0.9,..., 0.9}. As reference, the limit case where p = 1—that 
is, i = 0—corresponds to the case where UMA is fed with the identity matrix I, ef¬ 
fectively being oblivious of any noise in the training set. More generally, the values 
of C are being shifted away from the diagonal as p decreases, the equilibrium point 
being p = 0 where C is equal to the true confusion matrix M. Consequently, a pos¬ 
itive (resp. negative) approximation factor means that the noise is underestimated 
(resp. overestimated), in the sense that the noise process described by C would 
corrupt a lower (resp. higher) fraction of labels from each class than the true noise 
process applied on the training set, and corresponding to M. Figure 1(b) plots the 
confusion rate against this approximation factor. 

On Figure 1(a) we observe that UMA clearly provides improvement over the 
Perceptron algorithm for every noise level tested, as it achieves lower confusion 
rates. Nonetheless, its performance degrades as the noise level increases, going 
from a confusion rate of 0.5 for small noise levels—that is, when \\C — diag(C')||i;’ 
is small—to roughly 2.25 when the noise is the strongest. Comparatively, the Per- 
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(a) Robustness to noise 



(b) Robustness to noise estimation 


Fig. 1: (a) evolution of the confusion rate (y-axis) for different noise levels (x-axis); 
(b) evolntion of the same quantity with respect to errors in the confusion matrix 
C (x-axis) measured by the approximation factor (see text). 


ceptron algorithm follows the same trend, but with higher confusion rate, ranging 
from 1.7 to 2.75. 

The second simulation (Fig. 1(b)) points out that, in addition to being robust to 
the noise process itself, UMA is also robust to underestimated (approximation factor 
p > 0) noise levels, but not to overestimated (approximation factor p < 0) noise 
levels. Unsurprisingly, the best confusion rate corresponds to an approximation 
factor of 0, which means that UMA is using the true confusion matrix and can 
achieve a confusion rate as low as 1.8. There is a clear gap between positive and 
negative approximation factors, the former yielding confusion rates around 2.6 
while the latter’s are slightly lower, around 2.15. From these observations, it is 
clear that the approximation factor has a major influence on the performances of 
UMA. 


4.2 Real data 

4-2.1 Experimental Protocol 

In addition to the results on synthetic data, we also perform simulations in a 
realistic learning scenario. In this section we are going to assume that labelling 
examples is very expensive and we implement the strategy evoked in Example 1. 
More precisely, for a given dataset 5, proceed as follows: 

1. Ask for a small number m of examples for each of the Q classes. 

2. Learn a rough classifier^ g from these Q x m points. 

3. Estimate the confusion (7 of gr on a small labelled subset 5conf of 5. 

4. Predict the missing labels y of 5 using y; thus, y is a sequence of noisy labels. 

5. Learn the final classifier /uma from 5, y, C and measure its error rate. 


^ For the sake of self-containedness, we use UMA for this task (with C being the identity 
matrix). Remind that, when used this way, UMA acts as a regular Perceptron algorithm 
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One might wonder why we do not simply sample a very small portion of 5 in 
the first step. The reason is that in the case of very uneven classes proportions 
some of the classes may be missing in this first sampling. This is problematic 
when estimating C as it leads to a non-invertible confusion matrix. Moreover, the 
purpose of g is only to provide a baseline for the computation of y, hence tweaking 
the class (im)balance in this step is not a problem. 

In order to put our results into perspective, we compare them with results 
obtained from various algorithms. This allows us to give a precise idea of the 
benefits and limitations of UMA. Namely, we learn four additional classifiers: fy 
is a regular Perceptron learned on 5 labelled with noisy labels y, /conf and 
are trained with the correctly labelled training sets 5conf and 5 respectively and, 
lastly, /s 3 VM is a classifier produced by a multiclass semi-supervised SVM algorithm 
(S3VM, [3]) run on S where only the labels of 5conf are provided. The performances 
achieved by fy and /mu provide bounds for UMA’s error rates: on the one hand, fy 
corresponds to a worst-case situation, as we simply ignore the confusion matrix 
and use the regular Perceptron instead—arguably, UMA should perform better than 
this—; on the other hand, fiuii represents the best-case scenario for learning, when 
all the correct labels are available—the performance of /f^n should always top that 
of UMA (and the performances of other classifiers). The last two classifiers, /conf and 
/s 3 VM, provide us with objective comparison measures. They are learned from the 
same data as UMA but use them differently: /conf is learned from the reduced training 
set iSconf and /ssvm is output by a semi-supervised learning strategy that infers both 
/s 3 VM and the missing labels of 5 and it totally ignores the predictions y made by 
g. Note that according to the learning scenario we implement, we assume C to 
be estimated from raw data. This might not always be the case with real-world 
problems and C might be easier and/or less expensive to get than raw data; for 
instance, it might be deduced from expert knowledge on the studied domain. In 
that case, /conf and /s 3 vm rnay suffer from not taking full advantage of the accurate 
information about the confusion. 

4-2.2 Datasets 

Our simulations are conducted on three different datasets. Each one with different 
features. For the sake of reproducibility, we used datasets that can be easily found 
on the UCI Machine learning repository [2]. Moreover, these datasets correspond 
to tasks for which generating a complete, labelled, training set is typically costly 
because of the necessity of human supervision and subject to classification noise. 
The datasets used and their main features are as follows. 

Optical Recognition of Handwritten Digits. This well-known dataset is composed of 
8 x8 grey-level images of handwritten digits, ranging from 0 to 9. The dataset 
is composed of 3,823 images of 64 features for training, and 1,797 for the test 
phase. We set m to 10 for this dataset, which means that g is learned from 100 
examples only. 5conf is a sampling of 5% of S. The classes are evenly distributed 
(see Figure 2(a)). We handle the nonlinearity through the use of a Gaussian kernel- 
PCA (see section 3.6) to project the data onto a feature space of dimension 640. 

Letter Recognition. The Letter Recognition dataset is another well-known pattern 
recognition dataset. The images of the letters are summarized into a vector of 
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(a) Handwritten Digits 



(b) Letter Recognition 



(c) Reuters 


Fig. 2: Class distribution for the three datasets. 


16 attributes, which correspond to various primitives computed on the raw data. 
With 20,000 examples, this dataset is much larger than the previous one. As 
for the Handwritten Digits dataset, the examples are evenly spread across the 26 
classes (see Figure 2(b)). We uniformly select 15, 000 examples for training and the 
remaining 5,000 are used for test. We set m to 50 as it seems that smaller values 
do not yield usable confusion matrices. We again sample 5% of the dataset to 
form 5conf and use, as before, a Gaussian kernel-based Kernel-PCA to (nonlinearly) 
expand the dimension of the data to 1, 600. 

Reuters. The Reuters dataset is a nearly linearly-separable document categoriza¬ 
tion dataset of more than 300, 000 instances of nearly 47, 000 features each. For 
size reasons we restrict ourselves to roughly 15,000 examples for training, and 
15, 000 other for test. It occurs that some classes are so underrepresented that 
they are flooded by the noise process and/or do not appear in 5confj which may 
lead to a non-invertible confusion matrix. We therefore restrict the dataset to the 
9 largest classes. One might wonder whether doing so erases class imbalance. This 
is not the case as, even this way, the least represented class accounts for roughly 
500 examples while this number reaches nearly 4, 000 for the most represented 
one (see Figure 2(c)). Actually, these 9 classes represent more than 70 percent of 
the dataset, reducing the training and test sets to approximately 11, 000 examples 
each. We do not use any kernel for this dataset, the data being already near to 
linearly-separable. Also, we sample 5conf on 5% of the training set and we set 
m = 20. 


4-2.3 Results 

Table 1 presents the misclassihcation error rates averaged on 10 runs. Keep in mind 
that we have not conducted a very thorough optimization of the hyper-parameters 
as the point here is essentially to compare UMA with the other algorithms. Addi¬ 
tionally, we also report the error rates of /ssvm when trained on the kernelized data 
with all dimensions, that is the kernelized data before we project them onto their 
D principal components. Because the projection step is indeed unbecessary with 
S3VM, this will give us insights on the error due to the Kernel-PCA step. Comparing 
the hrst and the last columns of Table 1, it appears that UMA always induces a 
slight performance gain, i.e. a decrease of the misclassihcation rate, with respect 

to fy. 
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Dataset 

fy 

fconi 

/full 

/S3VM 

UMA 

/s 3 VM (no K-PCA) 

Handwritten Digits 

0.25 

0.21 

0.04 

0.15 

0.16 

0.07 

Letter Recognition 

0.35 

0.36 

0.23 

0.49 

0.33 

0.18 

Reuters 

0.30 

0.17 

0.01 

0.22 

0.21 

0.22 


Table 1: Misclassification rates of different algorithms. 


From the second and third columns of Table 1, it is clear that the reduced 
number of examples available to /conf induces a drastic increase in the misclassifi¬ 
cation rate with respect to /mu which is allowed to use the totality of the dataset 
during the training phase. 

Comparing UMA and /conf in Table 1 (fifth and second columns), we observe that 
UMA achieves lower misclassification rates on the Handwritten Digits and Letter 
Recognition datasets but a higher misclassification rate on Reuters. Although this 
is likely related to the strong class imbalance in the dataset. Indeed, some classes 
are overly represented, accounting for the vast majority of the whole dataset (see 
Fig. 2(c)). Because 5conf is uniformly sampled from the main dataset, /conf is 
trained with a lot of examples from the overrepresented classes and therefore it is 
very effective, in the sense that it achieves a low misclassification rate, for these 
overrepresented classes; this, in turn, induces a (global) low misclassification rate, 
as possibly high misclassification rates on underrepresented classes are counter¬ 
vailed by theirs accounting for a small portion of the data. On the other hand, 
because of this disparity in class representation, the slightest error in the confu¬ 
sion matrix, granted it involves one of these overrepresented classes, may lead to 
a significant increase of the misclassification rate. In this regard, UMA is strongly 
disadvantaged with respect to /conf on the Reuters dataset and it is the cause of 
the reported results. 

The error rates for the S3VM and UMA classifiers are close for the Reuters and 
Handwritten Digits datasets whereas UMA has a clear advantage on the Letter 
Recognition problem. On the other hand, note that we used the S3VM method in 
conjunction with a Kernel-PCA for the sake of comparison with UMA in its kernelized 
form. The last column of Table 1 tends to confirm that this projection strategy 
increase the error rate of fssm- Also, reminds that the value of m does not impact 
the performances of /ssvm but has a significant effect on UMA, even though UMA never 
uses these labelled data. For instance, on the Reuters datasets, increasing m from 
20 to 70 reduces UMA’s error rate by nearly 0.1 (see the error rates of Fig. 3 (m = 70) 
when the size of labelled data is close to 550, that is 5% of the whole dataset). 
Despite our efforts to keep m as small as possible, we could not go under m = 50 
for the Letter Recognition dataset without compromising the invertibility of the 
confusion matrix. The simple fact that an unusually high number of examples are 
required to simply learn a rough classifier asserts the complexity of this dataset. 
Moreover, the fact that fy also outperforms /ssvm implies that the labels fed to UMA 
are already mostly correct, and, according to our working assumptions, this is the 
most favorable setting for UMA. 

Nonetheless, the disparities between UMA and /conf deserve more attention. In¬ 
deed, the same data are being used by both algorithms, and one could expect 
more closeness in the results. To get a better insight on what is occurring, we have 
reported the evolution of the error rate of these two algorithms with respect to 
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Number of labelled data 

Fig. 3: Error rate of UMA and /conf with respect to the sampling size. Reuters 
dataset with m = 70 for the sake of hgure’s readability. 


the sampling size of 5conf in Figure 3. We can see that UMA is unaffected by the 
size of the sample, essentially ignoring the possible errors in the confusion matrix 
on small samples. This reinforces our previous results showing that UMA is robust 
to errors in the confusion matrix. On the other hand, with the addition of more 
samples, the refinement of the confusion matrix does not allow UMA to compete 
with the value of additional (correctly) labelled data and eventually, when the size 
of iSconf grows, /conf performs better than UMA. This points towards the idea that 
the aggregated nature of the confusion matrix incurs some loss of relevant informa¬ 
tion for the classihcation task at hand, and that a more accurate estimate of the 
confusion matrix, as induced by, e.g., the use of larger 5confi may not compensate 
for the information provided by additional raw data. 

Building on this observation, we go a step further and replicate this experiment 
for all of the three datasets; only this time we track the performances of /ssvm 
instead. The results are plotted on Figure 4. For the three datasets, we observe 
the same behavior as before. Namely, UMA is able to maintain a low error rate even 
with a very small size of 5conf- On the other hand, UMA does not benefit as much 
as other methods from a large pool of labelled examples. In this case, UMA quickly 
stabilizes while, to the contrary, the S3VM method starts at a fairly high error rate 
and keeps improving as more labelled examples are available. 

Beyond this, it is important to recall that UMA never uses the labels of 5conf 
(those are only used to estimate the confusion matrix, not the classifier—refer to 
Section 4.2.1 for the detailed learning protocol). While refining the estimation of 
C is undoubtedly useful, a direction toward substantial performance gains should 
revolve around the combination of both this refined estimation of C and the use 
of the correctly labelled training set 5conf- This is a research subject on its own 
that we leave for future work. 
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(a) Reuters dataset 



Number of labelled data 
(b) Digits dataset 



Number of labelled data 
(c) Letter dataset 


Fig. 4: Error rates for the Reuter (left), optical digit recognition (center) and letter 
(right) datasets with respect to the size of 5conf- Average over 15 runs. 



Number of Updates Norm of the Confusion Matrix 

(a) Error rate (b) Confusion rate 


Fig. 5: Error and confusion risk on Reuters dataset with various update strategies. 


All in all, the reported results advise us to prefer UMA over other available meth¬ 
ods when the amount of labelled data is particularly small, in addition, obviously, 
to the motivating case of the present work where the training data are corrupted 
and the confusion matrix is known. Also, another interesting finding we get is that 
even a rough estimation of the confusion matrix is sufficient for UMA to behave well. 

Finally, we investigate the impact of the selection strategy of (p, q) on the 
convergence speed of UMA (see Section 3.5). We use three variations of UMA with 
different strategies for selecting (p, q) (error, confusion, and random) and monitor 
each one along the learning process on the Reuters dataset. The error and confusion 
strategies are described in Section 3.5 and the random strategy simply selects p 
and q at random. 

From Figure 5, which reports the misclassification rate and the confusion rate 
along the iterations, we observe that both performance measures evolve similarly, 
attaining a stable state around the 30th iteration. The best strategy depends on 
the performance measure used, even though regardless of the performance measure 
used, we observe that the random selection strategy leads to a predictor that does 
not achieve the best performance measure (there is always a curve beneath that 
of the random selection procedure), which shows that it not an optimal selection 
strategy. 
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As one might expect, the confusion-based strategy performs better than the 
error-based strategy when the confusion rate is retained as a performance measure, 
while the converse holds when using the error rate. This observation motivates us 
to thoroughly study the confusion-based strategy in a near future as being able to 
propose methods robust to class imbalance is a particularly interesting challenge 
of multiclass classification. 

The plateau reached around the 30th iteration may be puzzling, since the 
studied dataset presents no positive margin and convergence is therefore not guar¬ 
anteed. One possible explanation for this is to see the Reuters dataset as linearly 
separable problem corrupted by the effect of a noise process, which we call the 
intrinsic noise process that has structural features ‘compatible’ with the classifi¬ 
cation noise. By this, we mean that there must be features of the intrinsic noise 
such that, when additional classification noise is added, the resulting noise that 
characterizes the data is similar to a classification noise, or at least, to a noise 
that can be naturally handled by UMA. Finding out the family of noise processes 
that can be combined with the classihcation noise—or, more generally, the family 
of noise processes themselves—without hindering the effectiveness of UMA is one 
research direction that we aim to explore in a near future. 


5 Conclusion 

In this paper, we have proposed a new algorithm, UMA—for Unconfused Multiclass 
Additive algorithm—to cope with noisy training examples in multiclass linear 
problems. As its name indicates, it is a learning procedure that extends the (ul¬ 
traconservative) additive multiclass algorithms proposed by [10]; to handle the 
noisy datasets, it only requires the information about the confusion matrix that 
characterizes the mislabelling process. This is, to the best of our knowledge, the 
first time the confusion matrix is used as a way to handle noisy label in multiclass 
problems. 

One of the core ideas behind UMA, namely, the computation of the update vector 
Zpq, is not tied to the additive update scheme. Thus, as long as the assumption 
of linear separability holds, the very same idea can be used to render a wide 
variety of algorithms robust to noise by iteratively generating a noise-free training 
set with the consecutive values of Zpq. Although, every computation of a new 
Zpq requires learning a new classifier to start with. This may eventually incur 
prohibitive computational costs when applied to batch methods (as opposed to 
online methods) which are designed to process the entirety of the dataset at once. ^ 

UMA takes advantage of the online scheme of additive algorithms and avoids 
this problem completely. Moreover, additive algorithms are designed to directly 
handle multiclass problem rather than having recourse to a bi-class mapping. The 
end-results of this are tightened theoretical guarantees and a convergence rate that 
does not depend of Q, the number of classes. Besides, UMA can be directly used with 
any additive algorithms, allowing to handle noise with multiple methods without 
further computational burden. 

^ Nonetheless, from a purely theoretical point of view, UMA makes at most 0(1/6^) mistakes 
(see proposition 4) and computing Zpq can be done in 0(n) time. Therefore, polynomial batch 
methods do not suffer much from this as their overall execution time is still polynomial. 
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While we provide sample complexity analysis, it should be noted that a tighter 
bound can be derived with specific multiclass tools, such as the Natarajan’s di¬ 
mension (see [12] for example), which allow to better specify the expressiveness of 
a multiclass classifier. However, this is not the main focus of this paper and our 
results are based on simpler tools. 

To complement this work, we want to investigate a way to properly tackle 
near-linear problems (such as Reuters). As for now the algorithm already does a 
very good jobs due to its noise robustness. However more work has to be done 
to derive a proper way to handle cases where a perfect classifier does not exist. 
We think there are great avenues for interesting research in this domain with an 
algorithm like UMA and we are curious to see how this present work may carry over 
to more general problems. 
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A Double sample theorem 

Proof (Proposition 3) 

For a fixed pair (p, q) C 3^^, we consider the family of functions 

^pq = {/ : /(») = x) ■ Wp, Wq e 

where is a d-dimensional unit ball. For each / £ Fpq define the corresponding 
“loss” function 

r{x) = l{f{x)) = 2- f{x). 

Strictly speaking, (x) is not a loss as it does not take y into account, nonethe¬ 
less it does play the same role in the following proof than a regular loss in the 
regular double-sampling proof. One way to think of it is as the loss of a problem 
for which we do not care about the observed labels but instead we want to classify 
points into a predetermined class—in this case q. 

Clearly, Tpq is a subspace of affine functions, thus Pdim(J^pq) < (d-t- 1), where 
Pdim(d^pq) is the pseudo-dimension of Tpq. Additionally, I is Lipschitz in its first 
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argument with a Lipschitz factor of L = 1. Indeed Vt/, t/i, 1/2, £ V '■ K(3/i>?/) “ 

i{y2,y)\ = \yi - y2\- 

Let Vpq be any distribution over X x y and T £ {X x y)^ such that T ~ 
then define the empirical loss err^[/] = gt ^^e expected loss 

err^i/] = [K®>S/)] 

The goal here is to prove that 


|err^[/] - errr[/]| > ej € O g»neVi2s^ (21) 

Proof (Proof of (21)j We start by noting that Z(t/i,i/2) £ [0,2] and then proceed 
with a classic 4-step double sampling proof. Namely: 


Symmetrization. We introduce a ghost sample T' £ {X x T)"*, T' ~ and 
show that for f^^'^ such that |errp^^ — err^ | > e then 

Pt'IT (|err[r/- err^^| < 

as long as me^ > 32. 

It follows that 

P(T,r')~i5™xX>™ [ sup |err^[/] -err[r,[/]| > ^ 

\fGJ^pq 

> Pt~-d™ ^|errr[/r‘‘'^] - errp^| > ej x Fj^i^rp (^jerr^/- err^^| < - 

> > e) 

= ( sup |errT[/] - err^ [/]| > e] (By definition of 

Thus upper bounding the desired probability by 


2 X P(t,t')-X)-x'D" 


( sup |errT[/] 

pq 


errT'[/]| > 


( 22 ) 


Swapping Permutations. Let define Pm the set of all permutations that swap one 
or more elements of T with the corresponding element of T' (i.e. the ith element 
of T is swapped with the ith element of T'). It is quite immediate that [Tml = 2"*. 
For each permutation a £ Pm we note o-{T) (resp. cr{T')) the set originating from T 
(resp. T') from which the elements have been swapped with T' (resp. T) according 
to a. 

Thanks to Pm we will be able to provide an upper bound on (22). Our starting 
point is that (T,T') ~ V'^q x V'^q then for any a £ Pm, the random variable 
sup^gjTp^ |erry[/] — err[p,[/]| follows the same distribution as supjgjr^^ ~ 

er4(T')[/ll- 
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Therefore: 


( sup |errT[/] -errT/[/]| > ^ 

pq 


= Ef 


( sup |err;,(T)[/]-errJ,(T')[/]l > I) 

(TGFm \j^J^pg / 

2 ^ E |er4(T)[/]-erMT')[/]l > I 

o-GLm f/S-Cp? 


< sup 
{T,T')e{Xxyy 


( sup |err^( 7 .)[/] -err^( 7 P)[/]| > | 


(23) 


which concludes the second step. 


Reduction to a finite class. The idea is to reduce J-pq in (23) to a hnite class 
of functions. For the sake of conciseness, we will not enter into the details of the 
theory of covering numbers. Please refer to the corresponding literature for further 
details {e.g. [14]). 

In the following, Af{e/s,J-pq,2m) will denote the uniform efseonvering number 
of Tpq over a sample of size 2m. 

Let dehne Qpq C Tpq such that {1^^'')\(t,T') is an e/s-cover of (^"^’’'’)|(t,T')- 
Thus, \Qpq\ < J\f{e/s,l^^‘‘,2m) < oo Therefore, if 3/ € Tpq such that — 

er4(T')[/]l ^ 5 ^9 £ Gpg such that |err[,.(y)[g] - eTrl^^j.,^[g]\ > | and the 

following comes naturally 


PcTGr„ sup |err[,( 7 .)[/] -err[,( 7 P)[/]| > | 

\f€F'pq J 

< (^max |err[,(r)[fl] - errl(^j-,-^[g]\ > 

< A/'(e/8,Z'^^L2m) max (|err[,(T) [p] - err[,(r,) [pj] > (union bound) 

HoefFding’s inequality. Finally, consider lerr],.^^.^ [g] — err],.^^/^ [^j] as the average 
of m realizations of the same random variable, with expectation equal to 0. Then 
by Hoeffding’s inequality we have that^ 


(|err[,(r)[ 5 ] - err[,(jP) [g]| > |) < 2e (24) 

Putting everything together yields the result w.r.t. J\f{e/s,l^^'‘,2m) for me^ > 
32. For me^ < 32 it holds trivially. 

Recall that jg Lipschitz in its first argument with a Lipschitz constant 
L = 1 thus Af(e/8,l-^^s2m) < Af(e/8,.Fp9,2m) = O 


^ Note that in some references the right-hand side of (24) might viewed as a probability 
measure over m independent Rademacher variables. 
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The last part of the proof comes from the observation that, for any fixed (p, q), 
we had never used any other specific information about Tpq other than the upper 
bound of d + 1 over its pseudo dimension. In other words, equation (21) holds for 
slightly modified dehnition of Tpq as long as the pseudo dimension does not exceed 
d+ 1. 

Let us now consider : 

^pq = {/ : /(®) = = q}l{a: £ Ap) {wp - Wq, x) : Wp, Wq £ B‘^} 

Clearly for each function in J^pq there is at most one corresponding affine 
function, thus Tpq and Tpq share the same upper bound of d + 1 on their pseudo¬ 
dimension. 

Consequently, any covering number of J-pq is also a covering number of J-pq. 
More precisely, this proof holds true for any Wp and Wq, independently of Ap 
which may itself be defined with respect to Wp and Wq. 

It comes naturally that, fixing S as the training set, the following holds true: 

^ '^l{t{x) = q}l{x £ Ap}x = Zpq. 
r 

Thus 

erMf] -err^[/] 

We can generalize this result for any couple (p, q) by a simple union bound, 
giving the desired inequality: 



^{Xxy)~T> 


sup 


< O 



Wp — Wq 
\\Wp - Wq 




Equivalently, we have that 


) ^J'P 


> e 


with probability 1 — d for 


In 




m £ O 


+ ln((3) -h din 

















